New Transcriptomic Biomarkers of 5-Fluorouracil Resistance

The overall response rate to fluoropyrimidine monotherapy in colorectal cancer (CRC) is limited. Transcriptomic datasets of CRC patients treated with 5-fluorouracil (5FU) could assist in the identification of clinically useful biomarkers. In this research, we aimed to analyze transcriptomic cohorts of 5FU-treated cell lines to uncover new predictive biomarker candidates and to validate the strongest hits in 5FU-treated human colorectal cancer samples with available clinical response data. We utilized an in vitro dataset of cancer cell lines treated with 5FU and used the reported area under the dose–response curve values to determine the therapeutic response to 5FU treatment. Mann–Whitney and ROC analyses were performed to identify significant genes. The strongest genes were combined into a single signature using a random forest classifier. The compound 5-fluorouracil was tested in 592 cell lines (294 nonresponders and 298 responders). The validation cohort consisted of 157 patient samples with 5FU monotherapy from three datasets. The three strongest associations with treatment outcome were observed in SHISA4 (AUC = 0.745, p-value = 5.5 × 10−25), SLC38A6 (AUC = 0.725, p-value = 3.1 × 10−21), and LAPTM4A (AUC = 0.723, p-value = 6.4 × 10−21). A random forest model utilizing the top genes reached an AUC value of 0.74 for predicting therapeutic sensitivity. The model correctly identified 83% of the nonresponder and 73% of the responder patients. The cell line cohort is available and the entire human colorectal cohort have been added to the ROCPlot analysis platform. Here, by using in vitro and in vivo data, we present a framework enabling the ranking of future biomarker candidates of 5FU resistance. A future option is to conduct an independent validation of the established predictors of resistance.


Introduction
Globally, colorectal cancer (CRC) accounts for 10% (1.9 million) of the overall cancer incidence and 9.4% (935,000) of deaths caused by cancer. In 2020, it placed third in incidence and second in mortality among all cancer types [1]. In terms of pathogenesis, approximately 70% of the diagnosed cases are sporadic, 25% are familial, and the remaining 5% are inherited [2]. The primary treatment option is surgery, but in advanced stages, systemic chemotherapy is the standard treatment. The adjuvant therapy for resected CRC patients without metastases includes capecitabine, oxaliplatin, 5-fluorouracil (5FU), and leucovorin [3]. Treatment for metastatic patients is based on 5FU, oxaliplatin, and irinotecan, and, depending on the genetic profile, may also involve biological therapies such as those targeting the VEGF or EGFR pathways [4].
Predictive biomarkers can be used to predict the outcome of a therapeutic intervention and can provide information on the expected success rate of a particular therapy, the existence of therapeutic resistance, or the development of a serious adverse reaction. Although biomarkers can contribute to personalized treatment with improved outcomes, the number of predictive biomarkers in colorectal cancer is still limited. One of the few biomarkers that influence therapeutic decision-making is microsatellite instability (MSI). The 5FU monotherapy treatment is not effective in patients with high MSI, but oxaliplatin-based treatment can offer a therapeutic benefit. Another biomarker affecting therapeutic decisions is the presence of KRAS/NRAS mutations, which decrease the efficacy of EGFR therapies [5]. An important predictive biomarker for avoiding serious adverse reactions to 5FU-based treatment is the test for dihydropyrimidine dehydrogenase (DPD) deficiency in patients [6].
Both 5FU and capecitabine are fluoropyrimidines, which are members of the class of agents known as antimetabolites. As their chemical structure shares a number of common traits with the substrate of enzymes essential for DNA synthesis, antimetabolites can disrupt the DNA structure and ultimately lead to tumor cell death. Besides colorectal cancer, fluoropyrimidines are also used to treat breast, head and neck, ovarian, and gastrointestinal tumors, as well as basal cell carcinomas. The anti-tumor effects of fluoropyrimidine analogs are complex and involve three different mechanisms of action: (a) the 5FU metabolite fluorodeoxyuridine monophosphate (FdUMP) inhibits thymidylate synthase, which is essential for thymidine synthesis, (b) fluorouridine triphosphate (FUTP) is incorporated into RNA, and (c) fluorodeoxyuridine triphosphate (FdUTP) is incorporated into DNA resulting in DNA strand breaks [7].
The overall response rate to fluoropyrimidine monotherapy in colorectal cancer is limited and is in the range of 10% to 15% [8]. A higher response rate of up to 45-50% can be achieved with additional oxaliplatin or irinotecan treatment; however, in these cases, toxicity will also increase [9]. The incidence of CRC is rising and 5FU is considered a key drug; therefore, there is an urgent need to be able to identify patients who may benefit from 5FU-based therapies. Transcriptomic datasets profiling tumors of CRC patients treated with 5FU could provide help with making a significant advancement in this area.
In this study, we aimed to analyze transcriptomic cohorts of 5FU-treated cell lines to uncover new predictive biomarker candidates and validate the strongest hits in 5FU-treated colorectal human samples with clinical information and response data. Our overall goal was not only to identify single genes that could serve as biomarkers in patients treated with fluoropyrimidine but also to combine the top candidates into a single predictive tool by employing machine learning.

Features Significant in the In Vitro Dataset
The compound 5-fluorouracil was tested in 907 cell lines, with a minimum screening concentration of 0.125 µM and a maximum concentration of 32 µM. Based on the reported AUDRC values, 294 were categorized as nonresponders (AUDRC range: 0.931-0.991), 298 as responders (AUDRC range: 0.099-0.801), and 315 (AUDRC between 0.931 and 0.801) were excluded from the analysis. The most sensitive solid tumor cell lines were found to be PSN1 (pancreas cancer), CAL148 (breast cancer), and JHU011 (head and neck cancer), whereas the most resistant cell lines were LN18 (glioblastoma), HEC1 (uterus cancer), and ASH3 (cervix cancer) ( Table 1).
All available genes (n = 15,791) were tested by the Mann-Whitney U test and ROC analysis, and we found statistically significant differences between nonresponders and responders in 2484 genes. The strongest associations with treatment outcome were observed in SHISA4 (shisa family member 4 gene, n = 592, ROC AUC: 0.745, Mann-Whitney U test p-value: 5.5 × 10 −25 ), SLC38A6 (solute carrier family 38 member 6 gene, ROC AUC = 0.725, p-value = 3.1 × 10 −21 ), and LAPTM4A (lysosomal protein transmembrane 4 alpha gene, ROC AUC = 0.721, p-value = 1.2 × 10 −20 ). The top ten most significant genes are presented in Table 2, and the complete list of all genes is provided in Supplemental Table S1.

Clinical Sample Database Construction
Our search criteria were met by a total of 805 CRC patients' samples from 12 datasets with available treatment information including response and raw gene expression data ( Figure 1B). The detailed characteristics of all screened datasets are presented in Table 3. After normalization of the raw gene expression data and standardization between the different measurement platforms ( Figure 1C), mRNA expression for a total of 19,890 genes was available. Out of 805 patients, we filtered those who had received multiple treatments: 180 patients had received bevacizumab, 221 had received irinotecan, and 438 oxaliplatin. Note that some patients had received multiple combinations of these; thus, the number of samples remaining-of patients who had received monotherapies of either 5FU or capecitabine-was reduced to 157. As our goal was to directly link the in vitro data (which was based on 5FU monotherapy) and the clinical samples, we used only these 157 samples as the validation cohort in the present study. However, the entire cohort is available for further research at the www.rocplot.com/colorectal website.

Validation of Significant Genes in the Clinical Sample Database
Of a total of 2484 genes previously identified in the in vitro database, 742 genes reached a statistically significant association in the clinical samples as well. The complete list of all significant genes is presented in Supplemental Table S2. The radar chart of AUC values and gene expression boxplots in resistant and sensitive samples of the strongest genes are presented in Figure 2.

Functional Annotation of Validated Genes
Significant genes were further examined by gene ontology biological process and molecular function overrepresentation analysis. The most significant terms were response to oxidative stress (35 genes, adjusted p-value: 3.90 × 10 −2 ) for biological process and GTPase activity (28 genes, adjusted p-value: 9.79 × 10 −3 ) for molecular function. The strongest significant GO categories are presented in Table 4.

Machine Learning Approach to Set Up an Integrated Predictive Tool
In the machine learning analysis, we used the entire combined database by employing the in vitro samples as the training set (total n = 592, nonresponder n = 294, responder n = 298), and the 5FU/capecitabine treated human samples as the test set (total n = 157, nonresponder n = 80, responder n = 77). A total of 12 genes remained in the final model and the genes that had the highest influence on the performance of the model were among the top ten genes identified in the in vitro databases, such as SHISA4, SLC38A6, PRPF38B, and LAPTM4A. Other genes with a significant effect on performance and overexpressed in the resistant phenotype include LPP and FOXF2, whereas genes overexpressed in the sensitive group were RPL3, MAP2K7, PCF11, INTS7, and TAGAP ( Figure 3).

Discussion
In our study, we had two major goals. On the one hand, since the backbone of systemic chemotherapy treatment in colorectal cancer is based on fluoropyrimidines, we first attempted to identify a set of genes that have a potential role in the chemoresistance against these agents. On the other hand, we developed a machine learning-based predictor for the identification of sensitive and resistant colorectal tumors using gene expression data.
Some information is already available for the best-performing genes involved in chemoresistance identified in the in vitro database. SHISA4 is a member of the Wnt pathway that has been previously linked to 5FU resistance in colorectal cancer [19,20]. The gene SLC38A6 is involved in glutamine transport and the silencing of the gene resulted in the inhibition of cell cycle progression and cell viability in a hepatocellular cancer cell line [21]. The overexpression of LAPTM4A negatively regulates the function of human organic cation transporter 2 [22] and overexpression of the transporter is associated with improved survival in colorectal patients [23]. Remarkably, we have found that all the most important genes determined by the random forest model were also significant in our integrated database.
A strength of our study is the independence of the in vitro training set and the test set of patient samples. We have to emphasize that here we directly linked cell line data which were based on 5FU monotherapy to clinical samples with fluoropyrimidine monotherapy. To our knowledge, this is the first such study with clinically meaningful sample numbers and with a sole focus on fluoropyrimidine monotherapy.
The robustness of our results is also supported by the fact that multiple significant genes of the most important gene ontology categories also contain several markers that have been previously linked to chemoresistance. For example, silencing the NFE2L2 gene increased 5FU sensitization in hepatocellular carcinoma cells [24], RTN4 knockdown promoted higher cytotoxic events in paclitaxel-treated cancer cells [25], and RND3 overexpression promoted drug resistance in gastric cell lines [26].
We have to note a limitation of our study. Although the sensitivity of the random forest model for identifying non-responder patients was 0.83, the achieved specificity was only 0.66. In other words, while the model had a high power for predicting resistance, this did not mean that the alternative estimate automatically confersi therapeutic sensitivity. A future study with an improved prediction model based on higher sample numbers could strengthen the classification sensitivity as well. Secondly, we have to note that we used tertiles of the AUDRC values for classification so that there were no cell lines with similar resistance values. In the nonresponder cell lines, the dose-response curves were markedly shifted to the right. While there could be a difference in response among these cells, the clinical utility of such high concentrations is negligible.
In summary, we identified a set of genes related to resistance against fluoropyrimidines by using gene expression profiles from cell lines and 5FU-and capecitabine-treated patients. Our results will help to rank 5FU biomarker candidates in future studies. We also constructed a machine learning model capable of linking the in vitro and in vivo models, which was able to correctly identify a high proportion of non-responder patients. By utilizing gene expression data from the primary tumor, the random forest model could be used to help therapeutic decision-making.

In Vitro Biomarker Discovery
To investigate the genes which may potentially influence the efficacy of 5FU therapy in vitro, we used the Genomics of Drug Sensitivity in Cancer portal (GDSC) version 1 drug screening database as a discovery dataset [27]. The preprocessing and normalization steps for generating the gene expression data tables were executed as described previously [28].
In the in vitro dataset, we used the reported area under the dose-response curve (AUDRC) values to determine the therapeutic response in 5FU-treated cell lines. We defined a cell line as a responder to the treatment if the reported AUDRC value was in the lower tertile considering all 5FU-treated cell lines. Cell lines with an AUDRC value in the upper tertile were categorized as nonresponders. Cell lines with AUDRC values in the middle tertile were excluded from the analysis ( Figure 1A). The IC50 values for the cell lines are provided in Supplemental Figure S1.
Gene expression across all genes among nonresponders and responders was compared using Mann-Whitney and receiver operating characteristic (ROC) tests in the R statistical environment (https://www.r-project.org/, accessed on 30 October 2022) using Bioconductor libraries (www.bioconductor.org, accessed on 30 October 2022). The significance cutoff for p-values was set at p < 0.01. The false discovery rate (FDR) was calculated using the web application www.multipletesting.com (accessed on 30 October 2022) [29], and only results with an FDR of less than 1% were accepted as significant.
The 'clusterProfiler version 4.0.5' R package [30] was used to perform the functional gene annotations of the new biomarker candidates.

Validation in a Dataset Comprising Clinical Samples
We searched the GEO (https://www.ncbi.nlm.nih.gov/geo/, accessed on 30 October 2022) and GDC (https://portal.gdc.cancer.gov/, accessed on 30 October 2022) portals to identify datasets suitable for the analysis. During this search, the keywords "colorectal", "cancer", "treatment", "response", and "survival" were used. We considered only those publications which included raw gene expression data, information on clinical treatment, response to the treatment or survival information, and involved at least ten patients.
The raw gene expression data were processed according to standard practices. For the samples measured by Affymetrix gene arrays, normalization was performed with the affy package [31], for the samples with Agilent-based chip, pre-processing was performed with the limma package [32], and samples measured by Illumina HiSeq data were processed using DESeq2 [33]. The integrated complete gene expression dataset consisting of samples from all three platforms was quantile normalized. Finally, a scaling normalization was applied to set the mean expression across all genes in each sample to 1000.
Since most samples were measured with the Affymetrix GPL570 platform, we used this platform as the basis for gene mapping between different platforms. Samples measured by other platforms were mapped to the GPL570 platform via the official gene symbols. We used JetSet [34] to select the most reliable probe set for genes measured by multiple probes. In cases where multiple probes matched a single gene, we selected the probe with the highest interquartile range [35].
Clinical samples were divided into responder and nonresponder groups according to the authors' categorization using the clinical annotation of the source dataset. If the outcome of therapeutic response presented four classes, they were combined into two categories: tumors with "progression" and "stable disease" were categorized as nonresponders, whereas samples with "partial response" or "complete response" were categorized as responders. In the clinical cohort, all genes significant in the training set were analyzed for correlation with resistance.

Machine Learning Approach to Set Up an Integrated Predictive Tool
We aimed to combine all available genes into a single predictive algorithm. For this, we had to combine the gene expressions of cell lines and human samples into a single database. As the gene expression dynamic ranges yielded a marked difference, we performed an additional normalization using the combat command of the SVA (version: 3.40.0) R package [36]. As for the analysis of the single genes, in the model-building process, the in vitro samples were assigned to the training set and the clinical samples to the test set. For the analysis, we retained only genes upregulated in nonresponder samples if the p-value in the Mann-Whitney test was below 0.01 and the FDR was below 1%. Next, we applied the boruta R package as a second feature selection method to detect the most important significant genes, which were then combined into a single predictor using a random forest algorithm. The randomForest (version: 4.6.14) [37] and caret (version: 6.0.90) [38] R packages were used in the analysis. To evaluate the overall performance of the final model, we performed a receiver operating characteristic analysis and determined the area under the curve (AUC).